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Abstract. We consider the inverse reinforcement learning problem, that is, 
the problem of learning from, and then predicting or mimicking a controller 
based on state/action data. We propose a statistical model for such data, de- 
rived from the structure of a Markov decision process. Adopting a Bayesian 
approach to inference, we show how latent variables of the model can be esti- 
mated, and how predictions about actions can be made, in a unified framework. 
A new Markov chain Monte Carlo (MCMC) sampler is devised for simulation 
from the posterior distribution. This step includes a parameter expansion step, 
which is shown to be essential for good convergence properties of the MCMC 
sampler. As an illustration, the method is applied to learning a human con- 
troller. 



1. Introduction 

1.1. Motivation. The problem of fitting a statistical model to observed actions 
has received si gnificant att ention in a va riety of disciplines . These include Op- 



timal Control (IRustl. [l988h Economics (iGotz and McCall [l980; W olpinl 



Rustl. Il987t iHotz and Millerl. Il993t IGeweke and Keand . 1200a iGeweke et"aT 



1984; 



1994: 



Aguirregabiria and Mira L 2002 ; Imai et al. , 20091) . and Machine Learning ( Ng and Russell , 



2000 ; Abbeel and Net 2004 ) . Across these cases there is some variety in the estima- 
tion aims and the assumed mechanisms which generate observed actions. We focus 
on the case in which it is assumed that the observations arise from an underlying 
Markov Decision Process (a full model specification is given in the next section). In 
this case, given the current state X of the system, the controller chooses an action 
A and receives an instantaneous reward specified by the function r, 

(X,A) -> r(X, A) G R 

where (X, A) is the state-action pair. The state then evolves according to a Markov 
kernel p(-\X,A), the controller chooses the next action, receives a reward and so 
on. The controller chooses its actions to maximize the average reward it will accrue 
over an infinite horizon. However, the controller may take sub-optimal decisions 
from time to time. This is captured by a "noisy" Markov Decision Process (MDP) 
model. 

A generic approach to automating a task is to mode l it as a control p r oblem, see 
Chapt er 8 of Bertsekas and Tsitsiklisl ( 1996t ). and also Bertsekasl ( 2005 ). Bertsekas 
(|2OO70 . which involves specifying a reward function and other elements of the model, 
and then solving for the optimal controller. Putting aside the difficulties associated 
with the last step, specifying the reward function is non-trivial and often achieved 
in practice by a heuristic process of trial and error, i.e. by observing the system 
under the computed optimal controller and then adjusting the reward function to 
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avoid observed undesired behavior. After this adjustment, the optimal controller is 
re-computed and this process is repeated until satisfaction. 

An alternative, often simpler approach is to obtain a sample path which is char- 
acteristic of desired behavior, e.g. by a human controller, and then estimate the 
control policy generatin g this sample path: this is known as the inv erse reinforce- 
ment learning problem (|Ng and Russell 120001 : lAbbeel and Ngl . [2004). Learning to 
mimic a controller has many potential applicatio ns in applied fields such as robot- 
ics and a rtificial in t ellige n ce dCoates et al. , 20091); biolog y, e.g. the st udy of anima l 
learning ()Watkinsl . 1987 ; Schmaiuk and Zanuttol . 19971) . economics ( Rust . 1987 ). 
and other fields. 

Our aim is to develop a purely statistical, and computationally tractable, solu- 
tion to this problem of mimicking behaviour based on a statistical model for the 
controller's actions, and the Bayesian approach. This is advantageous because it 
gives us a unified and principled framework for the following tasks: (a) to properly 
model uncertainty (i.e. the human controller may make mistakes); (b) to estimate 
jointly the control policy and the model parameters, (c) to predict future actions. 

In a parametric approach to estimation, the reward function and other elements 
of the model are assumed to be specific functiona l forms of a parameter vector 9 
( Gotz and McCalll Il980l: IWolpinl 11984 iRusti. Il98 7l Il988l: IXg uirregabiria and Mira . 



20021 iHotz and Millerl . Il993t llmai etal1 T2009) . The best parametric estimate may 



then be computed, for example, by maximizing the likelihood of the observed data 
with respect to 0. From a computational perspective, we shall see that this approach 
is cumbersome for a noisy MDP model because the likelihood of an observed action 
given observed states is an intractable integral. We avoid this difficulty by targeting 
the control policy directly; as we shall see, a specific quantity called the optimal 
value function. This gives an additional justificat ion to the Bayesian ap proach in 
this context, as the data augmentation principle ( Tanner and Wongl 1987 ) makes it 
possible to estimate the model without computing the difficult integral mentioned. 



1.2. Contributions. The contributions of this work are as follows. We adopt a 
Bayesian approach to modelling state/action data derived from the structure of an 
MDP. Our approach is inspired b y the pioneering work of I Albert and Chibl (jl993l ) 
and McCulloch and Rossi (1994) in the context of statistical inference in discrete 
choice models, where the computations are performed using a Gib bs samp l er ap - 



plie d to an enlarged (or aug mented) model. In subsequent work, iNobild (|19981 ) 
and Imai and van Dvkl ( 2005|) enhanced the compu tatio nal efficiency of the Gibbs 
sampling technique while IMcCulloch et al. (2000) and Imai and van Dvk ( 2005 ) 
devised new priors for the identified parameters of the model. 

We devise a new Gibbs sampling algorithm for inferring the optimal value func- 
tion. The propo sed algorithm is a Parameter Expanded Data Augmentation (PX- 
DA) algorithm (|Liu and Wul . Il999l iMeng and van Dvkl . Il999h . PX-DA improves 
upon the efficiency of standard DA by reducing the correlation between the sam- 
ples. This is achieved by inserting an additional simulation step into the algorithm 
which involves moving in the augmented data space. This extra simulation step is 
computationally inexpensive and leads to improved performance over standard DA 
algorithms. In fact, we give examples where the DA algorithm does not converge 
after a large number of iterations, whereas PX-DA does. The PX-DA algorithm we 
propose involves movement of the augmented data in the extra simulation step with 
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a combination of translation and scaling. We also implement an efficient Metropolis- 
Hastings kernel with independent proposals when sampling the augmented data. 

As an illustrative example of learning a human controller we apply our frame- 
work to the game of Tetris. Automating Tetris is a challenging benchm ark problem 
in the control literature, see for example iBertsekas and Tsitsiklisl (Il996lh and treat- 



ing it is difficult because the control model has a very large state space. Moreover, 
data from a human player is noisy, as we are prone to making errors. We show 
the proposed method can quite accurately mimic a given human player by per- 
forming posterior prediction from a limited amount of observed data from that 
player. By contrast, existing approaches from the control literature focus solely on, 
having specified the reward function, solving the associated difficult optimization 
problem using re i nforcement learning or o t her dynamic programming alg o rithm s 



pro Diem using rcmiorccmcnt learning or otner dynamic programming aigoritnm 
(jBertsekasl 120051 120071 : iTsitsiklis and Rovl . 1 19941 : IBertsekas and Tsitsiklisi . Il996h 



We also demonstrate the effect of the amount of data available from a player on 
posterior distributions which, under the proposed model, characterise their action 
preferences. 

1.3. Plan, notation. The organization of this paper is as follows. Section [5] de- 
fines the problem in detail and states the inference objectives. Section [3] describes 
the PX-DA method generally and then the specific implementation for the model 
we consider. Section |4] presents the PX-DA sampler in detail for the assumed priors 
and discusses some practical issues and extensions. Numerical results highlighting 
various properties of the proposed PX-DA algorithm are presented in Section [SJ 
as well as implementation details and results for Tetris. A proof of the correct- 
ness of the proposed PX-DA algorithm is presented in the Appendix along with 
implementation details of the MCMC algorithm. 

This section is concluded with a description of the notation used. Capital letters 
are used for random variables and lower case for their realizations. We use the colon 
short-hand for sequence of random variables, e.g. X^-k = (Xq, . . . , Xk). The letters 
/ and p are reserved for the probability densities or probability mass functions of 
random variables. For two jointly distributed random variables (X, Y), fx\Y, fx,Y 
and fx denote, respectively, the conditional probability density, the joint density 
and the marginal density. When the subscript is omitted, the arguments of / or p 
will indicate precisely the random variables to which the density corresponds. For 
example, p[x\y) is Px\y{ x \v)- The value at x of the multivariate normal probability 
density with mean /j, and covariance £ is denoted 7V(x; fi, £). For a vector v, the 
i-th component is denoted v(i). All vectors are column vectors and the transpose 
of v is indicated by v T . The m-dimensional vector comprised of ones (respectively 
zeros) only is denoted by l m (respectively m ). The subscript m is omitted when 
the dimension is obvious from context. I m will denote the to by to identity matrix. 
Similarly, pjij will denote the (i, j)-th element of the matrix S. ILa is the indicator 
function of the set A, i.e. Ia(x) = 1 if x € j4 and otherwise. K denotes the real 
line, R+ its strictly positive part and E is the mathematical expectation operator. 
The cardinality of a finite set A is denoted by The Dirac measure concentrated 
at a point x is denoted by S x . 

2. Problem Statement 



2.1. Markov decision processes. An MDP is comprised of a controlled Markov 
chain, a control process, a reward function and an optimality criterion. Each of 
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these are defined in turn below; see iBertsekad ( 20051 2007 ) for additional back- 
ground and details of other MDP optimality criteria. 

The state process, denoted {X t } t >i, is a X- valued controlled discrete time (so 
t is always an integer) Markov chain where X is the finite set {1,2,..., N}. Let 
{A t }t>i be the .4-valued control (or action) process where A = {1, 2, . . . , M} is the 
set of all possible controls. Given the entire realization of the state and actions up 
to time t > 1, the evolution of the state to time t + 1 is determined by the selected 
action and state at time t only, i.e. 



(1) 



X t+ i\ (X 1:t = xut, Ai; t = a\: t ) ~p(-\x t ,a t ), 



where for each state-action pair (x,a), p(-\x,a) is a probability distribution on X. 
The evolution of the action process is determined by a policy /i which is a mapping 
from the set of states to the set of actions. Particularly, for k > 1 

■Ail = Xl; t , Al; t _l = dQ ■ t - 1 ) ~ ^( It )(| 

Let r be a real valued function on X which is called the reward function. The 
reward at time t for being in state X t is r(X t ). We consider the following standard 
optimality criterion: a discounted sum of accumulated rewards over an infinite 
horizon, 



(2) 



Xi = x\ 



where j3 € (0, 1) is the discount factor ensuring the expectation is well defined. (If 
there exists a zero reward state which is absorbing, and all policies lead to this state 
with probability one for all initial states X\ then the expectation is well defined, 
provided A" is a finite set, without the discount /3.) The subscript on the expectation 
operator denotes the policy controlling the evolution of {X t }t>i- A policy fi* is 
said to be optimal if C^ixi) > C^{x\) for all (/x, x±). 

It is well known that fi* is characterized by the real valued function on X 7 denoted 
V, which satisfies the following fixed point equation, 



(3) 



V{x) = max < r(x) + £ p(x'\x, a)V{x') \ 
ae I x'ex ) 



In the literature on MDPs (Bertsekas and Tsitsiklisl Il996l ; iBertsekasl l2005l 120071 ). 
V is referred to as the (optimal) value function. Since V is a X — > R function, it 
will be treated as a vector in R^ from now on. Given V, the optimal policy fi* is, 
for all x € X, 



(4) 



(x) = argmax< 7 p(x \x,a)V(x ) > = argmax {(R t V) (a)} 

aeA £ — ' aeA 

Kx'ex ) 



where Rt is a short-hand for R Xt , and, for each x 6 X, R x is the M x N transition 
probability matrix with elements 



(5) 



[Rx]i,j = P(j\x, i), l<i<M, 1<j<N, 



with p(j\x,i) defined as in ([T]). Recall that in our notations (RtV)(a) is the a-th 
component of vector RtV. 
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2.2. A statistical model for imperfect policy execution. We consider the 
following statistical model for the action component at of each observed state-action 
pair (xt,at) built around the MDP framework. It is assumed that 

(6) A t = argmax{e t (a) + (i? t V)(a)} 

where Rt has been defined at the end of the previous section, and the et's, t > 1, 
are independent and identically distributed M-dimensional Gaussian variates, 

e T = (e t (l),...,e t (M))~^(0 M ,E). 

(The choice of the Gaussian distribution is for computational and inferential con- 
venience.) The inclusion of this noise process renders the model more versatile. It 
may be interpreted in two different ways. First, if there are several actions that are 
near optimal, in the sense quantified by the numerical value of the expression in the 
right hand side of (j4} , then the controller could have selected one of the near opti- 
mal actions in error. Thus while the policy is optimal, the execution of the policy is 
subject to disturbance. Second, it can be shown that © characterizes the optimal 
policy of an MDP with a mixed discrete-continuous state process, (x, e) g X x R , 
and reward function r : X x R M x A — > R given by r(x, e, a) = r(x) + e(a). Given the 
state (xt, et) at time t, and action a t , the discrete component of the next state, X t +i, 
is drawn from (JTJ) while the continuous component e t+ i is drawn from A/"(0m,5]). 
It follows from this separation in the evolution of the state components that there 



exists a vector V <E R'*' such the optimal policy for this MDP is given by ([6]) (jRust 



1988L Theorems 3.1, 3.3). In this model, the statistician only observes the discrete 
component of the state process and the action taken at each time, while e t is the 
unobserved random component of the reward known only to the decision maker. 

With respect to the interpretation of the model, note that, if (3 and V are fixed, 
then the reward function is entirely determined by (|3]). Thus, when inferring V 
from the model defined by ([I]) and ([6]), (and fixing /3), one is also implicitly infer- 
ring the optimality criterion (or equivalently the reward function) that governed 
the controller's behaviour, and that criterion is presumably unknown to the ob- 
server, prior to collecting data. In practical terms, this also means that it remains 
reasonable to apply this model even when the controller's actions seem inefficient 
or even erratic to the observer, as the observer and the controller may simply have 
very different policy criteria. 

2.3. Inference objectives. The data d consist of a sequence of state-action pairs, 
d = {d t }J =1 = {{xt,at)}J = i observed for T epochs and the aim is to infer V. It is 
assumed that the law of the controlled process, which is specified by the collection of 
transition matrices {Pa} a£ ^ is known, but the reward function r is unknown. This 
implies dU cannot be used to solve for V. The approach below can be generalized 
to the case when {P a } a&J { is unknown. However, assuming {P a } a< z_A is known is 
reasonable in a number of applications, in particular the human controller example 
studied in Section [5j 

In Bayesian setting, a prior for V is chosen and inference will be based on samples 
from the posterior Pv\D( v \d), henceforth denoted as p(v\d). (The specification of 
the prior over the optimal value function is postponed to Section |4j) These samples 
may then also be used via (j4|) to estimate the optimal policy /i* and thus predict the 
behavior of the system. In the context of the human controller example, d consists 
of the observed actions of a person. 
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The likelihood of the observed data is 

T 



p(d\v,E) = Y\_p(xt\x t -i, a t -i)p(a t \v,E,x t ) oc _Qp(a t \v,T,,x t ) 



t=i 



where, abusing notation, p{x\ \xq, ao) denotes the prior distribution for X\. The 
terms p{xt\xt-i, at-i) niay be omitted as they have no bearing on the desired 
posterior. 

The likelihood p(at \v, E, Xt ), or conditional choice probability, is the intractable 
integral 



(7) p(A t = a t \v,H,x t ) = / Af(u;R t v,E)du. 

J{ueR M :u(a t )>u(j) for all j^a t } 

Henceforth p(A t = a t \v, S, Xt ) will be abbreviated to p(at \v, £). The likelihood is 
invariant to both translations of the vector v and multiplications of it by positive 
scalars, 

(8) p(d\v,E) = p(d\ y /zJ(v + z 2 l),z 1 'E), V(zi,z 2 ) G K+ x R. 

The design of the PX-DA algorithm presented in the following Section is based on 
this property. 

The assumed model for the noise corrupting the action selection process re- 

sults in a target distribution similar to the multinomial probit (MNP ) problem 

( Albert and Chib!.ll993tlGeweke et al.lJl994t[McCulloch and Rosslll994tlMcCulloch etal 



2000; Ima i and van and the stated invariance of the likelihood to scal- 



ing (zi) is well documented in this literature. Specifically, in ([5]), the observed 
actions A t correspond to the observed chosen outcomes (among M alternatives), 
the random terms e t (a) may be interpreted as the non-observed part of the utility 
function, and the observed part of the utility function (R t V)(a) may be interpreted 
as a linear combination of covariates, where the covariates are the probabilities 
p(xt+i — j\xt,a t ), j = 1,...,N, and the unknown regression coefficients are the 
components of V. 

In the context of MNP models, the main existing approach to ensure the posterior 
is well defined for improper priors is to constrain enough parameters of the model 
to ensure identifiability of the remaining ones and then introduce priors for them. 
For example, by setting the last component of V to zero and then introducing a 
suitable prior for the remaining N — 1 non-zero components. We shall use a different 
approach as detailed in Section [4] 



3. The PX-DA Method 

Let /x : t p -> 1 be a target probability density from which samples are sought. 
In many applications, it is not possible to simulate from fx directly. However, it is 
often possible to introduce a random vector Y £ R 9 which is jointly distributed with 
X such that sampling from the conditional densities fx\ y and fy \ x is stra i ghtfor - 
ward. This is the principle of data augmentation (DA) ([Tanner and Wonj . Il987l) . 
Simulating from these densities sequentially as follows, 



(9) Y n+1 \X n = x n ~ fy\x{-\x n ), X n+1 \Y n+1 = y n+1 - fx\Y^\Vn+i), n>0, 
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results in a Markov chain {X n } n >Q with the correct a symptot i c dist ribution for any 
initial state Xo (under weak regularity assumptions) (jHoberti . 120111 ): 



(10) 



lim V(X n G A) 



fx{x)dx. 



As noted bv lLiu and Wul ( 1999f ). lMeng and van Dvkl (jl999l ) in some situations it is 
possible to improve the efficiency of this sampler by introduci ng auxiliary variabl es. 
This technique was termed parameter expanded (PX) DA bv lLiu and Wu (1999). 

Let A C M d and let {<pa}agA be a class of one-to-one differentiable functions mapping 
R q to itself. Let 



(11) 



My') = 



det 



9<p\,i(y) 



dy(j) 



where (px,i(y) is z-th component function of <p\(y). J\ is the Jacobian determinant 
of the mapping <p\ : M. q — > R 9 . Let Z be a random vector in A C R d with 
probability density fz- The aim is to reduce the auto-correlation between X n 
and X n+ i generated by the Gibbs sampler and PX-DA achieves this by inserting 
an extra simulation step as follows. 



A generic PX-DA sampler 



Given X n — x n at iteration n + 1, perform the following steps to sample X n+ \: 
Step 1. Sample Y n +\ from fyixi'l^n) and call the sampled value y n +i- (If exact 

sampling from fy\x is not possible, sample Y n +i from a Markov kernel that leaves 

fY\x{-\x n ) invariant.) 

Step 2a. Sample Z^_ t from fz(-), call the sample and let y n +i = 

(j/n+l)- 

Z n + 1 

(2) 

Step 2b. Sample another A- valued random variable, ^„_f_i' f rom the density 
which is defined (upto a proportionality constant) by 

(12) fY(tpz{yn+i))Jz(y n +i)fz(z)- 

Call the result z^L and set y' n+1 = <p <2> (y n +\). 
Step 3. Sample X n+X from fx\Y(-\y' n +i) 

The difference between the standard DA algorithm in and PX-DA is step 
2. Per iteration, PX-DA has a slightly greater computational cost due to the need 
to sample the variables (Z„\ Z^). However, in cases of practical interest, these 
variables are typically of a much lower dimension than X or Y and the increase in 
computational cost is often negligible. The benefit though, in terms of the mixing 
rate of the sampler , has been observed to be quite substantial in some situations 
( Liu and Wul . Q.999). Direct simulation from the probability density on A given 
by (|12p is possible for the specific family of mappings {^a}agA we consider in the 
sequel. When a direct draw from (fT2|) is possible the resulting PX-DA algorithm is 
termed exact. 
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Step 2 transforms the simulated random variable in step 1 from y n +i to y' n+ i via 
the intermediate value y n +i- Essentially step 2 is implementing a Markov transition 
from M. q to M. q using the kernel 



Q(y n+1 ,B)=E{l B (Y^ +1 )\Y n+1 =y n+1 }=E\l B ((p 



<P Z M 



<Y n +i))\Y n 



It can be shown that Q is reversible with resp ect to the marginal distribution fy of 
Y, and thus fy is also invariant for Q(y n , B) (|Liu and Wul . ll999l Theorem 1). This 
in turn implies that the invariant probability density of the Markov chain {A n } n >o 
generated by the PX-DA algorithm is indeed fx', if X n ~ fx then the law of Y n +i is 
fy and, since fy is invariant for Q, the l aw of Y^ +1 is also fy . 

As was noted by Liu and Wu ( 19991 ). Meng and van Dvkl (1999), it is possible 
to reduce the auto-correlation between the successive X n samples generated by the 
PX-DA algorithm by making the prior fz more diffuse. In fact, with a trivial 
modification, the PX-DA algorithm can still remain valid as an MCMC scheme 
when the prior is improper. The random draw in step 2a is then no longer well 
defined, but as we shall now see in the context of a specfic transformation, the 
correct procedure in this case is to omit this draw and set y n +i to y n +i from step 
1. All other steps remain unchanged. 

Let A = R + x M and 

(13) <p z (y) = — - z 2 l, z = (z u z 2 ) e K+ x R. 



z\ 



A result concerning the correctness of the PX-DA method when fz is improper is 
now stated. Although this has been established explicitly in the cas e of either scal- 



ing or translation only (|Liu and Wul . Il999t iMeng and van Dvk . 119991) . the extension 



to the present setting is not difficult. (See also Proposition 3 of 
(|2008l) .l 



Hobert and Marchev 



Proposition 1. Consider the transformation in US\) and suppose that 
c(y) f R xR fy(if z (y))J z (y)dzidz2, y G M 9 is positive and finite almost every- 
where. Then the Markov transition density on W defined by 



Q(y,B)= f 



fY{<p z {y))Jz{y) 
c(y) 



dz\dz2 



is reversible with respect to fy 



(Proof is in the Appendix.) 

For any h : MP — > K. which is square-integrable with respect to fx, i.e. / h 2 (x)fx(x)dx < 
oo, if a central limit theorem holds, then 



(14) 

where 
(15) 



L"£h(X n )AAf(E fjs 

n—1 



(h(X)),a 2 (h)) 



OO 



<T*(h) = co(h)+2\ci(h), c i (h)=E fx (h(X i )h(X ))-E fx (h(X )y, i>0. 



The convergence in (|14[) is in distribution and the expectations in the expression 
for cr 2 (h) are computed with respect to the law of the Markov chain {A„}„>o with 
initial distribution fx- 
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Hobert and Marchevl (|2008l ) studied the relative performance of DA and PX-DA 
algorithms, addressing the case of Haar PX-DA - a class of algorithms involving 
transformations derived from a particular group structure. The full details of Haar 
PX-DA are beyond the scope of this article, but we note, for example, that if Algo- 
rithm [T] is modified to perform only a scaling transformation and not translation. 



then i t is an instance of Haar PX-DA (the reader is also directed to iLiu and Wu 



(1999) "or details of the group structure underlying scaling and translations 



The following inequality for the asymptoti c variance of DA, PX - DA ( for any 
proper prior for Z) and Haar PX-DA is due to lHobert and Marchev ( 20081 ). 



er 2 (/i) > al(h) > o%(h) 

where the subscripts indicate the algorithm generating {X n } n >o; the standard DA 
is without subscript, the subscript P denotes PX-DA with a proper prior on Z and 
H denotes Haar PX-DA. Haar PX-DA is said to be the most efficient since it has 
a smallest asymptotic variance as measured by ([14)) . It should also be noted that 
IRovI (|2012f ) has recently established that a general "sandwich" data augmentation 
algorithm always converges as fast as its standard DA counterpart. Subject to the 
transition kernel Q in Proposition [T] being well defined, it is immediate that our 
algorithm using that Q is a sandwich data augmentation algorithm, and thus enjoys 
this ordering property. 

We stress that this variance inequality has only been shown to hold when fyix 
can be sampled from exactly in step 1 of Algorithm^ In our numerical experiments, 
this step is performed using a Metropolis-Hastings kernel, but as we shall see, 
empirical results suggest that with this modification the PX-DA algorithm still 
out-performs standard DA. 



4. A PX-DA SAMPLER FOR THE MDP MODEL 

The transformation of the augmented data will be as in (p~3|) . This section 
completes the description by specifying the prior for the optimal value function, 
the auxiliary variable Z and culminates with a statement of the complete sampling 
algorithm for these specific choices. Extensions to a more general reward function 
and the practicality of the approach for large problem sizes, specifically large X, 
are discussed at the end of the section. 

Regarding the prior for V , the following requirements seem reasonable: (a) it 
should respect the symmetry of the model regarding the N states; specifically, it 
should be invariant with respect to permutation of the state labels; (b) it should be 
conjugate, so that Gibbs steps can be implemented; and (c) to ease interpretation 
of the output, it should make the model identifiable. These requirements are met 
by the following prior distribution: a Gaussian N{0n, kIn) distribution (where k 
is a fixed hyper-parameter), but conditional on the event X}j=i ^(*) = 0- This 
prior distribution may be alternately described as follows: take U ~ Af(0jsr, kIn), 
then set V = U — A _1 ljvl^t/, that is, remove the mean of the U(i) to force the 
components of V to sum to zero. 

The constraint J^ i=1 V(i) = addresses the additive unidentifiability of the 
model, i.e. the fact that the likelihood is unchanged if the same constant is added to 
all the V(i). To fix multiplicative unidentifiability, i.e. the likelihood is unchanged if 
both V and £ are multiplied by the same constant, we take £ = I for the remainder 
of this Section. This choice presents an important advantage: it makes it possible 
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to implement Step 1 of Algorithm 1 using an efficient Metropolis-Hastings step, as 
described below. In Section |4~T1 we explain briefly how to consider a more general 
matrix S, and why we believe that £ = I should be sufficient in many practical 
(MDP) applications. 

We note in passing a differ ent way to treat additive un identifiability inspired 
by multivariate probit models (|McCulloch and RossiL 11994 : i.e. set one of the N 



components of the value function to zero, e.g. V(N) — 0. In our context however, 
this would suppress the symmetry between the N states, complicate the notations, 
and bring no obvious benefit. Also, additive unidentifiability can be exploited to 
yield a better PX-DA sampler. 
The augmented data is 

T T 

Y = (Wi,...,W T ), with p(u>i, . . . ,w t |v) = JJp(^i| u ) = \\^{ w t; Ri v , hi), 

i=l i=l 

which may be viewed as arising from "disintegration" of ©. Indeed, the PX-DA 
algorithm defined below will be derived from the joint density (fx,y(x, y) in section 
[3j with x = v and y = wi, . . . , wt)- 

p(v,wi, . . . ,wr\d) 
(16) 

T 

oc JV (v; Ojv_i, k/jv_i - kN~ 1 1 N - 1 iJ i _ 1 ) JJl{ WieK M. w . {a . ) > Wi{j) ^^ a ^U{w l ] RiV, I M ), 

i=l 

with the slight abuse of notation that the vector v in J\f(u>i; RiV, Im) is N dimen- 
sional where 7V-th component is 

JV-l 

v(N) = - J2 «(0. 
i=i 

(This convention will hold for the remainder of this section wherever RiV occurs.) 
The density (|16p clearly admits the posterior over v as a marginal. Direct simula- 
tion from p(wi, wt\v, d) is difficult in general, due to the presence of truncated 
Gaussian distributions. This is where our Metropolis-Hastings step will come in; 
further discussion is postponed until the end this section. Putting aside this dif- 
ficulty for now, we next describe a PX-DA algorithm for this model, i.e. derived 
from the standard DA algorithm which iteratively samples from the conditionals 
p(v\w\, wt, d) and p(w\, wt\v, d). 

The transformation of the augmented data for the PX-DA scheme is given in 
(H3|). We set fz u z 2 (zi,z 2 ) = fz 2 (z2)fz 1 (zi) and 



(17) Zi~/G(a,6), Z 2 ~Af{0,K/N), 

where IG is the inverse Gamma density. We stress that here k is the same parameter 
as appearing in the prior distribution over U, specified earlier in this section. It 
is this structure which allows us to construct a PX-DA algorithm incorporating a 
translation move. 

To clarify the connection with the description of the generic PX-DA sampler in 
section O with a slight abuse of the definition of (p^ 1 , 

Y> = y-^Y) = (^(Wf + z 2 ll), . . . , + z 2 l T M )) T , 
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and the Jacobian in ([XT]) is 

Jz(y') = z l 2 . 

With this choice of transformation of the variables, step 1 and 2a of the generic 
PX-DA algorithm [I] can be combined into step 1 of algorithm [2] below. Similarly, 
step 2b and 3 of algorithm [T] may be combined into step 2 of algorithm [2] 

The Metropolis Hastings kernel (with independent proposals) for step 1 of Algo- 
rithm [5] presented in the Appendix is quite efficient with acceptance rates typically 
around 70 percent for the numerical examples in Section [5] Step 2 can be imple- 
mented as detailed in Section 17.31 When improper priors are used for V, Z\ and 
Z 2 , the corresponding terms in ([20]) should be omitted. As discussed in Section |3l 
when improper priors are used for Z\ and Z 2 , these variables should not be sampled 
in step 1 above. However, one should be careful that (fT6| is still well defined when 
k = oo otherwise k should always be set to a finite value. (For instance, if the 
observed process is constant, then k = oo gives an improper posterior. However, 
we have been able to establish that k = oo can give a proper posterior under quite 
general conditions; details may be obtained from the authors.) 

4.1. Extensions. 

4.1.1. Action dependent Rewards. In Section [5] it was assumed that the reward 
function is not action dependent. The following extension to the criterion in @ 
can be considered. Replace r(X t ) in ([2]) by 

(18) r(X t ,A t ) = n(X t ) + r 2 (A t ). 



PX-DA for the MDP model 



Let Wi-t and v be the samples after iteration n. At iteration n + 1, perform the 
following two steps. 

Step 1: Sample Z\ ~ IG(a, b), call the result zi, sample Z 2 ~ A/"(0, k/N) and let 
z 2 denote this sampled value. For each i = 1, T, sample Wi from the truncated 
Gaussian 

(19) I{«; ( eR M : w i (a i )>w i (j),ji£a i }-N'(Wi;R4V, hi), 

call the result Wi and set w[ = ^pz{ (wi + Z21m)- (This step can be achieved directly 
or using the Metropolis-Hastings kernel detailed in Section [7T2l ) 

Step 2: Sample (V(l), V(N - 1), Z 2 , Zi) from the joint density 

M (v; Ojv-i,k//v-i ~ kA" 1 1 w _i1^_ 1 ) J^A/" ( - ziXm'-, RiV, Im ) 

i=i W z i / 

MT 

(20) x N(z 2 -Q,k/N)z 1 2 IG(zi;a,b) 

and z\ ' w[ — z 2 1m, i = 1,- ■ ■ ,T, are now the final W\-t for iteration n + 1. 

Note that V for this new problem still satisfies (j3]) with the reward function 
therein replaced by (|18|) . In this case the action generation model is now 

A t = arg max {e t (a) + r 2 (a) + (3 (R t V) (a)} 
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and 

p(A t = i\v,r 2 ,'S,Xt) = / Af{u;r 2 + R t v,T,)du. 

J{u£R M :u(i)>u(j)foT all j^i} 

It can be verified that, for all (zi, z%, Z3) € M+ xtxE, 

p(A t = i \\fz\(y + z 2 l), y/zi(r 2 + z 3 l),ziY,,x t ) = p(A t = i \v, r 2 , E, x t ). 

The prior for V could be the same as before (see Section [3} and one could also use 
a prior with the same structure for r 2 . 

4.1.2. Large State-spaces. Since V is a vector of length \X\, the approach detailed 
thus far will be impractical for a very large state-space X. In this setting we 
may regress the optimal v alue function onto a set of basis functions. (A s imilar 
approach was proposed by iGeweke and Keand (|l996l ). iGeweke and Keand (|2000l) 



for a finite horizon dynamic discrete choice problem and the idea goes back some 



way in the control literature, see for example iSchweitzer and Seidmannl ([1985)) Let 



{4>i\i<i<K be a collection of basis functions, mapping X to the real line. Typi- 
cally K is much smaller than \X\. It is assumed that the conditional expectation, 
Y^x'eX c t ) i( x ')p( x '\ x i a )i can be computed easily for each state-action pair (x, a) and 
i. For example, this would be true if p(x'\x, a) is non-zero for only a handful of 
values of x', see the human controller example considered in Section The action 
generation model is (for an action independent reward), 



argmax I e t (a) + V* V(i) (R t 4>i) (a) > , 
aeA I 7=1 J 



and the corresponding likelihood satisfies 

p(A t = i \y/i~iv, ZiT,, x t ) = p(A t = i \v, E, x t ). 

The likelihood is no longer invariant to scalar translations of the value function. 
As the model is no longer additively unidentifiable, an unconstrained prior may 
thus be defined over all N components of V. For example, the prior N(On, kIn) is 
admissible even as k — > 00. The PX-DA implementation for this model will involve 
transforming the augmented data by a scalar multiplication only. 

4.1.3. Constrained Actions. In some applications, state dependent action constraints 
are present, i.e. not every action in A is permitted in every state. The modification 
to Algorithm [2] is trivial. For example, if action j is not permitted in state Xi, then 
row j of the R Xi defined in (JSJ) is deleted. Action constraints are present in the 
example studied in Section [S] 

4.1.4. Non-identity Noise Covariance Matrix. We think that restricting the model 
to an identity covariance matrix for the noise term is very reasonable for the fol- 
lowing reasons. First, in the MDP context, one is mostly interested in inferring V 
as S is merely a nuisance parameter. Second, since only one action is observed at 
a time, it seems hard to estimate correlations between the different components of 
the noise vector. Third, considering a general S means that the dimension of the 
parameter space becomes 0(M 2 ), and the computational burden 0(M 3 ), as op- 
posed to O(M) for both quantities in the E = J case. (The computational burden 
increases also because of the greater difficulty to sample the latent variables Wi, as 
explained below.) This is clearly impractical when M is large. 
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However, for the sake of completeness, we now explain how to account for a 
general covariance matrix E. The prior suggested in llmai and van Dvkl (|2005h may 



adapted to the present setting. A prior for the covariance matrix E subject to 
the constraint [EJ^i = 1 is constructed by normalizing the samples from an inverse 
Wishart distribution. Specifically, E - IW(i/, S) and E = E/[E]i,i. Let z Y = 
then, 

p(z 1; E) oc lEl-W^exp f-£-tr(S^)\ (z^^ 



where constant a 2 satisfies S — a 2 S. The conditional density 

p( 2l |E)=ZG^,^tr(S'E- 1 ; 

is now the new distribution for the scaling parameter in the PX-DA transformation 
of the augmented data; see (TIT)) . To infer E as well, Algorithm [2] would be modified 
to sample Wi-.t, V and then E in turn. For a non-diagonal covariance matrix, step 
1 cannot be implemented with the Metropolis-Hastings kernel described in Section 
17.21 A possible alternative is to use a Gibbs sampling step where, for each i, each 
component of Wi is sampled conditioned on the remaining components. Once a 
complete cycle has been performed, then the transformation at the end of step 
1 can be applied. Step 2 will be modified to sample (Z\, Z2, V, E) conditioned on 
wi-.t, which can be performed by an appropriate blocking scheme after the change of 
variable in (|30[) . Roughly speaking, (Z\, Z2, V) is sampled conditioned on (E, w\ : t) 
and then (Zi,E) conditioned on (Z2, V, u>i : t)- The samples produced may suffer 
from much more correlation than in the case of Algorithm [2] which is catered to 
E = I. 



5. Numerical Examples 

5.1. Toy Example. To demonstrate the performance improvements of PX-DA 
over standard DA, a data record of 20 state-action pairs was generated from the 
model with 7 states, 3 actions. The true optimal value function was drawn from 
the prior. Algorithm [2] was run for 5 x 10 5 iterations and half were discarded for 
burn in. The parameters of the priors in (I17|) were chosen a = b = 1, k = 2500. 
Figure Q] shows the empirical auto-correlation of the MCMC output for some of 
the components of the estimated optimal value function. The improvements due to 
scaling and translation of the augmented data are isolated. For the components of 
the value function not shown, the improvements were comparable. The acceptance 
rate for the Metropolis-Hastings kernel used to implement step 1 of Algorithm [5] 
was in excess of 95%. 

Figure [2] isolates the effect of an improper prior in PX-DA (i.e. a = b = 0, 
k = 00). This study is restricted to PX-DA that scales the augmented data only 
since the prior for the value function in (|16j) also depends on the parameter k 
that controls the variance of the law of the translation parameter (| 1 7[) . Figure [2] 
shows the computed autocorrelation for components 4 and 7 of the estimated value 
function. For the proper prior, a = 5, b — 0.5. In this case the improper prior for 
the scaling parameter yields a modest improvement in performance over the proper 
prior. (Note though there no longer the issue of tuning the prior for the scaling 
parameter.) 
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Figure 1 . Post-burn in autocorrelation plots of posterior samples 
for components (left) 4 and (right) 7 of the value function. Solid 
line is standard DA, dash-dot is PX-DA with scale move only, 
dashed line is PX-DA with scale and translation move. 




10 20 30 40 50 10 20 30 40 50 



lag lag 

Figure 2. PX-DA post-burn in autocorrelation plots of posterior 
samples for component 4 (left) and 7 (right) of the value function. 
Dash-dot line is PX-DA with scale move only and a proper prior 
with a = 5, b = 0.5. Solid line is PX-DA with scale move only and 
an improper prior. 

The final experiment demonstrates a situation where PX-DA converges but DA 
does not. The data comprising of 20 state-action pairs of the previous examples is 
extended to 50 by appending 30 more. In this example, the priors for V, Z\ and 
Zi are improper. The posterior mean of the value function calculated with PX- 
DA with scaling and translation is [2.34, -0.92, -1.02, 4.12, 5.69, -1.79, -8.41] T . 
(1.5 x 10 6 posterior samples but half discarded for burn in. The posterior mean 
for PX-DA with scaling or translation only was practically the same.) The PX-DA 
implementations were initialized with the value function set to 100 x I7. Shown 
in Figure [3] is the trace plot of the samples of component 7 of the value function 
obtained using the DA method initialized with the value function set to 10 x I7. 
The mean of the second half of the samples in Figure |3] is —3.56. (In fact all other 
components of the mean of the posterior value function calculated with DA are 
quite far out.) In this case we see that DA fails to converge even though initialized 
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far closer to the true values than PX-DA. Finally, to isolate the improvements 
due to scaling and translation, the autocorrelation plots of certain components of 
the posterior samples of the value function are compared in Figure [3] for PX-DA 
implemented with both additive and scaling, scaling only and additive only. In this 
example, the translation move appears more beneficial than scaling. 




0.2 0.4 0.6 0.8 1 

sample index *io 6 



Figure 3. From left to right: PX-DA post-burn in autocorrela- 
tion plots of posterior samples for component 4 and 7 of the value 
function; and DA trace plot of posterior samples of component 7 of 
the value function, lack of stationarity is apparent. For the auto- 
correlation plots, solid line is PX-DA with translation move only, 
dash-dot is PX-DA with scale move only, dashed line is PX-DA 
with scale and translation move. The mean of the second half of 
the samples from DA is —3.56 whereas the true posterior mean 
(calculated with PX-DA for which the three implementations are 
in agreement) is —8.4.. It appears the DA algorithm has not con- 
verged. 

5.2. Application to Human Controller Learning. In this section we apply the 
proposed method to an MDP which arises in the context of the popular computer 
game Tetris. In this game the player controls the positions and orientations of 
random two-dimensional shapes, henceforth the blocks, which arrive over time and 
occupy a field of play, henceforth the board, in a non-overlapping manner. 

5.2.1. Model Definition. In the MDP formulation of Tetris, the state X = (C,tj) 
consists of two components. The first component, £, is the current configuration of 
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Figure 4. The seven blocks of Tetris. 



FIGURE 5. Example iteration of Tetris. From left to right: 1) A 
block appears at the top of the board. 2) An action is taken to 
rotate and translate the block. 3) The block then falls until it 
reaches occupied squares. 4) Fully occupied rows of the board are 
removed. 

the board and is expressed as a 30 x 10 binary matrix. The second component, 77, 
is the index of a block. We consider 7 distinct blocks, shown in Figure 0] and thus 
i] takes values in {1, 2, 7}. Each action A consists of the angle through which to 
rotate the current block (0°, 90°, 180°, 270°), and the number of squares by which 
to move it left or right. For each state not all combinations of horizontal translation 
and rotation are necessarily permitted as the block must remain entirely within the 
boundaries of the board and must not overlap with any occupied squares. We write 
A(x) for the set of actions which are valid in state X = x. 

From the current state Xt = ((t,ilt) and an action At <E A(Xt), the evolution of 
the state occurs according to 

(21) ( t+1 =Tjj(( t ,r] U A t ), »h + i~W(l,2,...,7), 

where ip is a deterministic mapping which describes the evolution of the board 
configuration once the action has been chosen. For a configuration Qt with no 
occupied squares in the top row, ip yields the new configuration £t+i by moving 
the block r\ t according to A t , then allowing the block to "fall" until it reaches 
an occupied square or the bottom row of the board, and then removing any fully 
occupied rows. For a configuration £ t which has an occupied square in the top row, 
ip sets ( t +i = Ct irrespective of A t and i] t . The latter corresponds to "termination" 
of the game; once such a state is reached, subsequent actions do not influence the 
state. A pictorial representation of one iteration of the game is given in Figure [5J 
As each state consists of a single board configuration and block type, the total 
number of states in the Tetris model is rather large. We therefore adopt the ap- 
proach outlined in Section |4] and regress the value function on to a collection of 
K basis functions {(pi, <p2, — , <Pk}, which depend on the board configuration ( but 
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not on the randomly falling piece r\. (Note that the latter is not controlled as new 
blocks arrive independently of the action and the previous state.) Specific details 
of the basis functions are given in section 15.21 

We assume that the reward is independent of the action and the action generation 
model is then 



(22) 



A t = arg max \ e ( (a) + V^(i) • (0, o ip) (Cum, a) } . 



We assume that the noise corrupting the action choice has identity covariance. The 
likelihood of observed data is 

T 

p(d\v,Y,) = ^p(a t \v,x t ,T,), 
t=i 

where for each k = 1, T, 

p(A t =i\v,x t ,E) = / Af(e;R t v,I Mt )de. 

J {eeR M t : e(i)>e(j) for all jjti} 

Here M t := |.4(xt)| and Rt is the M t x N matrix with entries specified by 

[ R t]ij ■= (<f>j ^)((t,Vt,i)- 

In this case the likelihood is invariant to scaling in the sense that 

p(d\v,a 2 I) = p{d\\fz\v , z\o 2 1) , Vzi € R+. 

The sampling algorithm for inference is Algorithm [5] where the augmented data and 
transformation are given by 

Y = (W U ..., W T ), Y' = ^(Y) - (VSTWf, . . . , v^T^) T , 

where the scaling factor Z\ ~ IG(a, b). The Jacobian for this transformation is 

JM = *i 2 ■ 

The prior for V is M(On, kIn)- 

We consider the following K — 3 basis functions which were found to capture 
various features of the board configuration. 4>\ , the height of the top- most occupied 
square in the board, across all the columns; 02, the number of unoccupied squares 
which have at least one occupied square above them in the same column; </>3 the 
sum of the squared diffe r ences between occupied heights of adj acent columns. 



In iTsitsiklis and Rovl (|1994l ). iBertsekas and Tsitsiklisl (|1996f ). for a board with c 



columns, fa, 4>2 and 2c — 1 additional features were used to construct an automated 
self-improving Tetris playing system using Reinforcement Learning techniques. In 
contrast, the emphasis here is to make predictions about actions and mimic play on 
the basis of observed state-action data. In our setup the latter amounts to posterior 
prediction, which can be performed in the following manner. Let {V n }^ =1 be a 
collection of post-burn-in samples from the posterior distribution over the value 
function, obtained from the PX-DA algorithm. Then for each state in a given 
sequence {{Ct,i]t)}t=i we would like to make predictions under our model about 
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the corresponding action, on the basis of the posterior samples {V n }n=i- To this 
end, for each (Ct,Vt) we define the MAP predicted action as 

L 

A™ AP (( t , Vt ) := arg max V I f A,, fc (Ct, Vt) = a] , 

n—1 
\ 

(23) A n<k (C t ,r) t ) := arg max < e ntk (a) + V, V n (j) ■ (<pj o t/j) (£ t , r/ t , a) 

where for each 1 < n < £, 1 < fc < T and a € A{xt), e n ,fe(a) is an independent 
A/"(0, 1) random variable. 

In the following section the predictive performance of the model is assessed for 
a number of data sets. Each data set is divided into two subsets. The PX-DA 
algorithm is used to draw samples from the posterior corresponding to the first 
subset and then the accuracy of the posterior prediction is assessed using the second 
subset. This assessment is performed in terms of the empirical action error, defined 
as 

(24) £ a :=I^l[2MAP (Ct)7?t) ^ a ; 

t=i 

where {(Ct, i]t, <it)}t=i is the second data subset. 

Finally we note that in practical situations computation of (|23|) may be expensive 
if L is large, in which case one may resort to heuristic action prediction based on a 
posterior point estimate of V . We do not explore this issue further. 

5.2.2. Experiment 1. The aim of the first numerical experiment is to verify that it 
is possible to recover a value function and perform accurate prediction from data 
when the truth is known. We consider three different value functions (—3, —15, —1), 
(0,5,0) and (—20,0,1). These value functions were chosen for purposes of expo- 
sition; the corresponding optimal policies lead to qualitatively distinct styles of 
play. Snap-shots of typical board configurations under play according to the ac- 
tion generation model for each of these value functions are given in the top row 
of Figure H The first value function, (—3,-15,-1), led to an "efficient" style of 
play in which the upper region of the board is rarely occupied. The second value 
function, (0,5,0), yields a policy which encloses many unoccupied spaces, leading 
to the distinctive zig-zag pattern displayed in the second columns of Figure The 
third value function, (—20,0,1), corresponds to a policy which tends to produce 
"towers" of occupied squares. 

For each of the three value functions, 500 observations (state/action pairs) were 
generated according to the model (f22j) with the state updated according to (|2T|) . 
During generation of the data, if the game terminated it was immediately restarted. 
For the value function (—3,-15,-1), termination did not occur within 500 time 
steps of the game. For the other two, termination typically occurred after 10 to 
20 time steps so the full data record of length 500 consisted of the concatenation 
of several data sets. In all three cases, the first 100 observations were reserved for 
inference and the remaining 400 used for assessment of predictive performance. 

For each value function the PX-DA algorithm, incorporating the Metropolis- 
Hastings kernel, was run independently targeting the posterior distributions corre- 
sponding to the first 10, 20, 50 and 100 observations. In each case the algorithm was 
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run for 5 x 10 5 iterations, with a burn in of 10 4 iterations. The Metropolis-Hastings 
acceptance rate was found to be between 0.5 and 0.9 in all cases. The parameters 
of the model were set to k = 2500 to give a relatively uninformative prior over the 
value function, and for the prior on the parameter z%, a = 3 and b = 10 5 . For 
these tuned values of a and b, using an improper prior over z\ led to negligible im- 
provements in performance. Post-burn in trace plots, histograms and kernel density 
estimates are shown in Figure [6] along with the true value function values for the 
case of inference from 50 observations. In all cases, the posterior marginals have 
significant mass in the neighborhood of the true value function values. 

Figure [6] also shows the autocorrelation for one component of one of the value 
function, from the output of the PX-DA and standard DA algorithms. This indi- 
cates that the PX-DA algorithm yields a significantly lower autocorrelation than 
the standard DA scheme. 

Figure [7] shows the predictive performance in terms of the prediction error £ a 
defined in equation (|24[) as a function of the number of observations used for infer- 
ence. In all cases £ a was computed using the remaining 400 observations, i.e. in 
([24| T — 400. These results verify that the predictive performance improves as the 
number of observations used for inference increases. 

The qualitative characteristics of play according to the three true value functions 
and according to the posterior predictions are also summarized in Figure [7J In this 
Figure, the top row shows snap-shots of board configurations. The bottom row 
shows snap-shots of play according to posterior predicted actions (with inference 
based on 50 observations) for a different block sequence {rjk} and with the state 
updated according to 

Ct+i = V (Ct,VtJ^ AP (Ck,Vk)) , Vt+i ~W(1,2,...,7). 

These results indicate that the predicted actions result in a style of play which is 
qualitatively similar to that obtained from actions generated according to the true 
value function. 

Lastly, with moves played according to A^ AP based on inference from the 100 
observations generated using the value function (—3, —15, —1), termination of the 
game did not occur within 250 time steps in 80 out of 100 trials. In this sense, 
play according to A^ AP compared quite well with play according to A t , where no 
termination occurred within 500 time steps. 

5.2.3. Experiment 2. The aim of the second experiment is to demonstrate inference 
and prediction from a data set of a human player, i.e. in this case the true value 
function is unknown. The game was played for 500 iterations and again, the first 
100 observations were reserved for inference and the subsequent 400 observations 
were reserved for assessment of predictive performance. The PX-DA algorithm was 
run using the same settings as in Experiment 1. Again, the Metropolis-Hastings 
acceptance rate was found to be between 0.5 and 0.9. Trace plots, histograms and 
kernel density estimates are displayed in Figure [5] for the case of inference from 
50 observations. Figure [8] also shows the empirical action error as function of the 
number of observations used for inference. The result indicates that even with three 
basis functions, it is possible to capture significant information about the player's 
policy. 
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Figure 6. Experiment 1. PX-DA post burn-in trace plots of 
4x 10 5 samples, histograms and kernel density estimates of the pos- 
terior marginal distributions corresponding to 50 observations for 
the three value functions. Top left: value function (—3,-15,-1). 
Top right: (—20,0,1). Bottom left: (0,5,0). True values are shown 
with vertical lines. Bottom right: auto-correlation as a function of 
lag of the first component of V for PX-DA (dashed) and DA (solid) 
algorithms in the case of the true value function (— 3,-15,— 1) T , 
from 4 x 10 5 post burn-in samples. 



5.2.4. Experiment 3. In the third experiment, the play of a human was recorded 
under time-pressure: at each iteration of the game a fixed time was allocated for 
an action to be chosen. The aim of this experiment was to validate the statistical 
treatment of the problem, by exhibiting the influence of the amount of data recorded 
on inferences drawn about a player's action preferences. The experiment concerns 
a situation in which the amount of data recorded is driven by the speed at which 
the player is forced to play; upon the appearance of a block at the top of the board, 
the player was allowed r seconds to decide how to move the block. If this time limit 
was exceeded, no action was recorded and the board was updated by allowing the 
block to fall without any rotation or translation. 

Figure [9] shows histograms of post-burn- in MCMC samples approximating pos- 
terior marginals. Each panel corresponds to a different value of the time-pressure 
parameter r (see caption for details). For each of the four values of r, the player 
was presented with the same sequence of 100 blocks. The hyper-parameters were 
set to the same values as in Experiment 2. The results indicate that, as the time 
allocated for decision making was increased, the data provide more information 
about their action preferences, and this is manifested in the concentration of the 
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Figure 7. Experiment 1. Right top row: Board snap-shots. From 
left to right the true value functions arc (—3,-15,-1), (0,5,0) 
and (—20,0, 1). Right bottom row: Board snapshots during play 
according to posterior predictive actions for a different block se- 
quence. Left: Posterior prediction errors as a function of the num- 
ber of observations used for inference for the three value functions: 
(-3,-15,-1) solid, (0,5,0) dashed and (-20,0,1) dash-dot. 
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Figure 8. Left: trace plots, histograms and kernel density esti- 
mates for of posterior marginals for the three components of the 
value function. Inference based on 50 observations. Right: Poste- 
rior prediction error as a functions of the number of observations 
used for inference. 

posterior marginals. A striking feature is the difference between the top two panels, 
especially in terms of mode locations. These two panels correspond respectively to 
t = 10 and r = 5 seconds for decision making. The player made 28 more decisions 
within the allocated time in the former case than in the latter, evidently leading to 
differences in posterior distributions over components of the value function. 

6. Conclusion 

Our approach to inferential computation, based on an MCMC scheme, is well 
suited to the situation in which one is presented with a batch of state/action data. 
In some situations, it may be that data actually arrive gradually over time, in 
which case one is faced with the computational task of approximating a sequence 
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Figure 9. Experiment 3. Histogram approximations of poste- 
rior marginals from MCMC output of 250 x 10 3 samples. Each 
panel corresponds to a different constraint on the time r al- 
lowed per decision. Top to bottom: r = 10,5,3,1 seconds. 
The player was presented with 100 blocks in total and made re- 
spectively/Users/sumeetpalsingh/Desktop/untitled folder /refs. bib 
100, 72, 55, 17 decisions within the time constraints. White, grey 
and black shaded histograms correspond to the three components 
of the value function. 



of posterior dist ributions, defined as the data become available. Sequential Monte 



Carlo methods ([Chopinl . l2002t iDel Moral et all I2006T ) are amenable to this kind of 



sequential inference computations and a possible extension of the work presented 
here is to develo p such methods for the class of models we consider. Recently, 



Zhang and Singhl (|2012l ) have applied the same probabilistic model and PX-DA 



sampler developed in this paper to Microsoft's skill-based ranking model. The goal 
is to estimate the joint probability distribution of the skills of all players (where 
the skill of each player is represented by a real number) from the observation of the 
outcomes of multiple games involving subsets of these players. Preliminary results 
indicate that the PX-DA sampler is more accurate in predicting the outcome of 
games involving closely ranked players compared to Microsoft's variational Bayes 
approach (called TrueSkill.) This research is being developed further by the authors. 
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7. Appendix 



7.1. P roof of Proposition^ The proof is essentially that of lHobert and Marchev 



(2008, Proposition 3) specialized to our specific choice for the PX-DA move which 
comprises of both a scale and translation. 

An operation on R + x R is defined as follows. For any constants z = (Si, 22), z = 
(zi,z 2 ) el+xl, let 

zz := {z\Z\, z 2 + z^ 1 := (z^ X , -Z\Z?). 

As a consequence of these definitions, <p z (ip z (y)) = <p zz (y), cp z -i(ip z (y)) = ip z (ip z -i(y)) 
y and (p~ 1 (y) = (p z -i(y). The following equivalences may be established by rou- 
tine integration. For any z £ R + x M and integrable functions hi,h 2 : R 9 — >• M, 
g : M + xR^I, 

(25) / h 1 {<p z {y))J z {y)dy = [ h 1 {y)dy, 



(26) 
(27) 



g{zz)dz\dz 2 = — / g(z)dz\dz 2l 
zi 



c{<p z (y)) 



c{y) 
z\J z {y) ' 



(28) j g(z 1 1 , — Z1Z2) — dz\dz 2 = [ g(zi, z 2 )dz\dz 2 . 

JR + xR z 1 JE + xI 

(f25|) and (|26|) follow from the change of variable formula while (f27| follows from 
(|2"B1 and the fact that 

Jzz(y) 



Jz{v~z{y)) 



My) 



Then, 



hi(y)h 2 (ip z {y)) — fY{y)dy 



dz\dz 2 



^-H^iy))) 

hi{(fi z -i (y))h 2 (y) / Y V ) .. fy(<p z -i (y))dy dz x dz 2 



dz\dz 2 



+ XI 



c(^-i(y))' 

Mlz- 1 (y))h 2 (y) f Y ^ J * 1 ^ / y (y, ! (y))dzxdz 2 

z\c{y) 

hi(<p z (y))h 2 (y) h {V ] J * {V) h (<p z {y))dz x dz 2 ,h, 



dy 



where the final three lines are established by invoking (|25p . (|2T|) and (J25J). This 
establishes the stated reversibility. 
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7.2. Metropolis Hastings Kernel . For each i = 1, ...,T, one must sample Wi 
from the truncated Gaussian given in (|19j) . The procedure for performing this step 
is discussed below for Wi, with the subscript omitted from the notation. 

Let W(i) ~ Af(fi(i), 1) independently, i = 1,. . . ,M. The aim is to sample the 
scalar random variables W(i)'s conditional on the event W(i) < W(l), for a fixed I 
and alH ^ I. Without loss of generality, take 1 = 1. The corresponding distribution 
for the W(i)'s may be decomposed as follows. The marginal density of W(l) is 

M 

p(w(l)) cx Pu (w(l)) = Af(w(l);n(l), 1) H $(w(l) - M (i)), 

and, conditional on = io(l), W(«)|{W(1) = tu(l)} ~ TA/" ( _oo,«,(i)] (mW, 1) for 

i = 2, . . . , M, independently, where $ denotes the cumulative distribution function 
of Af(0, 1), and TN\ a w(m, s 2 ) stands for the Af(m,s 2 ) distribution truncated to 
the interval [a, b]. 

Several effi cient algorithm s exists for sampling from a truncated Gaussian dis- 
tribution, see IChopinl (|201lh . We focus on the marginal of W(l). We derive an 
efficient independent Metropolis-Hastings step for W(l) based on a Af(m, s 2 ) pro- 
posal distribution. The acceptance rate reads: 

p u (w'(l))Af(w(l);m, S 2 ) 
p u (w(l))Af(w'(l);m,s 2 ) 

where w(l) and w'(l) denote, respectively, the current value and the proposed value 
W'(l) ~ J\f(m,s 2 ). The main issue is to derive a method for calculating a good 
Gaussian approximation Af(m, s 2 ) of p(w(l)). 

The Gaussian approximation J\f(m, s 2 ) is obtained iteratively. At each iteration, 
we use the following crude approximation: the function $(x) is replaced by constant 
one for x > 0, and by function Af(x; 0, 1) for x < 0. The latter approximation is 
justified by the fact that, for x — > — oo, x$(x)/Af(x; 0, 1) — > — 1 quickly. 

At first iteration, set (m, s 2 ) = (mu(l), 1). Then repeat the following steps: se- 
lect the factor i with largest fi(i) and multiply the current Gaussian approximation 
J\f(x; mo, s§) by either the density J\f(x; fx(i), 1) if fi(i) > mo, or by 1 otherwise. Dis- 
card factor i and repeat this procedure until all M — 1 factors have been accounted 
for. Set (m, s 2 ) to be the mean and variance of this resulting proposal. 

To refine this proposal, perform several Newton-Raphson iterations for finding 
the mode and the curvature of the mode of logp(w(l)) by using (m, s 2 ) as the 
starting values. All these operations take very little time, and leads to an acceptance 
rate close to one in most cases. This program is available upon request. 

7.3. Implementing Step 2 of Algorithm [2] . The density (|2"0)) can be written 
as 

T 

N (v; Ojv-i, kI N -i - kA _1 1 A t_i1^_ 1 ) - Ri*/z{(v + z 2 1n) ; Om , ziIm) 

i=l 

(29) 

x N(z 2 ;Q, KN- l )IG{z x ;a,b). 
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By implementing the change of variable 




(|2"§1) becomes 



T 



A/"(w;0 A r,KZi/ A r)Jj7V(w- - R i u;0 M ,z 1 I M ) x IG(zf,a,b). 



Sampling (U, Z\) is now straightforward: 



Z x ~ IG(— +a,b + SSR/2 + if/2), U\Z X = « x ~ JV(— S^iFw, S" 1 ). 



where 



SSi? = w T w - w T R(R T R)- 1 R T w, wls = (R T R)- 1 R T w, 



# = uIsVnk + (R T Ry 1 y 1 u LS , s = /jv«r 1 « _1 + zr 1 (-R T ^)- 

and w T = [(u4) T , . . . , («40 T 1 , i? T = ■ • ■ , • Here uls refers to the least 
squares estimate of u and SSR is the minimum mean-squared error. To recover 
(V,Z 2 ) from (U,Zi), let it denote the sampled random vector U, then (Z 2 ,V) is 
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